Abstract:Score normalization is an essential part for a spoken term detection (STD) system. In this paper, a two-stage score normalization method is proposed. Firstly, two features, rank-p and relative-to-max, are introduced into a discriminative score normalization method to get more discriminative confidence scores between correct and wrong candidate words, and thus the keyword verification is more accurate. Secondly, a term-weighted value evaluation metric based normalization method is applied to further optimize the performance of STD. Experimental results show that the proposed method takes advantages of both discrimination and metric-based score normalization methods, and it obtains better performance than the best single score normalization method does.
[1] MAMOU J, RAMABHADRAN B, SIOHAN O. Vocabulary Independent Spoken Term Detection // Proc of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, The Netherlands, 2007: 615-622. [2] CAN D, SARACLAR M. Lattice Indexing for Spoken Term Detection. IEEE Trans on Audio, Speech, and Language Processing, 2011, 19(8): 2338-2347. [3] VERGYRI D, SHAFRAN I, STOLCKE A, et al. The SRI/OGI 2006 Spoken Term Detection System[C/OL]. [2014-10-20]. http://www.cslu.ogi.edu/~zak/std07.pdf. [4] MILLER D R H, KLEBER M, KAO C L, et al. Rapid and Accurate Spoken Term Detection // Proc of the 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium, 2007: 314-317. [5] WANG Y, METZE F. An In-Depth Comparison of Keyword Specific Thresholding and Sum-to-One Score Normalization // Proc of the 15th Annual Conference of the International Speech Communication Association. Singapore, Singapore, 2014: 2474-2478. [6] MAMOU J, CUI J, CUI X D, et al. System Combination and Score Normalization for Spoken Term Detection // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013: 8272-8276. [7] SOTO V, MANGU L, ROSENBERG A, et al. A Comparison of Multiple Methods for Rescoring Keyword Search Lists for Low Resource Languages // Proc of the 15th Annual Conference of the International Speech Communication Association. Singapore, Singapore, 2014: 2464-2468. [8] LEE H Y, TU T W, CHEN C Y, et al. Improved Spoken Term Detection Using Support Vector Machines Based on Lattice Context Consistence // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Prague, Czech Republic, 2011: 5648-5651. [9] SEIGEL M S, WOODLAND P C, GALES M J F. A Confidence-Based Approach for Improving Keyword Hypothesis Scores // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013: 8565-8569. [10] TEJEDOR J, ECHEVERRIA A, WANG D. An Evolutionary Confidence Measurement for Spoken Term Detection // Proc of the 9th International Workshop on Content-Based Multimedia Indexing. Madrid, Spain, 2011: 151-156. [11] Povey D, GHOSHAL A, BOULIANNE G, et al. The Kaldi Speech Recognition Toolkit[C/OL]. [2014-10-20]. http://publica tions.idiap.ch/downloads/papers/2012/Povey_ASRU2011_2011.pdf. [12] HINTON G, DENG L, YU D, et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. [13] TIBREWALA S, HERMANSKY H. Multiband and Adaptation Approaches to Robust Speech Recognition // Proc of the 5th European Conference on Speech Communication and Technology. Rhodes, Greece, 1997: 2619-2622. [14] KUMAR N. Investigation of Silicon Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. Ph.D Dissertation. Baltimore, USA: Johns Hopkins University, 1997. [15] GALES M J F. Semi-tied Covariance Matrices for Hidden Markov Models. IEEE Trans on Speech and Audio Processing, 1999, 7(3): 272-281. [16] GHOSHAL A, POVEY D, AGARWAL M, et al. A Novel Estimation of Feature-Space MLLR for Full-Covariance Models // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas, USA, 2010: 4310-4313.